Pivoting data from wide to long to run many models at once
The tidymodels package advocates for a nest-map-unnest workflow for running many models at once. While this typically is used for some type of group as seen in the tidymodels docs, we can also do it for running many models at once from a wide dataset.
Our goal is to get all of the measures into a long form tibble so that we can fit all of the models at once, and plot it all at once.
This basic example is borrowed directly from the tidymodels docs.
First you nest the data by a grouping variable to get list-columns of each split data/tibble.
mtcars <- as_tibble(mtcars) # to play nicely with list-cols
nest_mtcars <- mtcars %>%
nest(data = c(-am))
nest_mtcars
# A tibble: 2 x 2
am data
<dbl> <list>
1 1 <tibble [13 × 10]>
2 0 <tibble [19 × 10]>
Now you can apply a lm() call w/ purrr::map() to each dataset, and then broom::tidy() the model output!
nest_mtcars %>%
mutate(
fit = map(data, ~ lm(wt ~ mpg + qsec + gear, data = .x)), # S3 list-col
tidied = map(fit, tidy)
) %>%
unnest(tidied) %>%
select(-data, -fit)
# A tibble: 8 x 6
am term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 4.28 3.46 1.24 0.247
2 1 mpg -0.101 0.0294 -3.43 0.00750
3 1 qsec 0.0398 0.151 0.264 0.798
4 1 gear -0.0229 0.349 -0.0656 0.949
5 0 (Intercept) 4.92 1.40 3.52 0.00309
6 0 mpg -0.192 0.0443 -4.33 0.000591
7 0 qsec 0.0919 0.0983 0.935 0.365
8 0 gear 0.147 0.368 0.398 0.696
Now each of the model metrics for automatic vs manual transmissions (0 vs 1) is easy to work with! We’ll use a similar approach (nest-map-unnest) albeit with a slightly different data structure for our following analysis.
We’ll be performing a similar analysis to what Josh Hermsmeyer’s did in his 538 article. The raw code for his analysis is available on his GitHub. Essentially he evaluated how stable metrics were year-over-year, by comparing:
* Metric in Year N
* Metric in Year N + 1
We’re skipping most of the nflscrapR aggregation (see the link for full script), the portion we can change a bit to make our lives easier is the repeated modeling.
Rather than having to generate a model for every metric one-by-one, we can generate the models for ALL the metrics in the datase at once, while still running the model just for each metric by the following year’s metric.
### QB Hits model ------------------------------------------------------------
qb_hits_model <- lm(data = team_defense_joined, qb_hit.y ~ qb_hit.x)
qb_hits_stability <- glance(qb_hits_model) %>%
mutate(metric = "QB Hits",
r.squared = round(r.squared, 3)) %>%
select(metric, r.squared)
So let’s load our libraries and get to modeling!
library(tidyverse) # all the things
library(espnscrapeR) # NFL data summaries
library(broom) # tidy modeling
library(glue) # nice string creation
First we’ll get all the data from the 2000-2019 seasons for NFL offenses via espnscrapeR::scrape_team_stats_nfl().
# Get data from espnscrapeR
all_off <- 2000:2019 %>%
map_dfr(scrape_team_stats_nfl)
glimpse(all_off)
Rows: 638
Columns: 25
$ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
$ team <chr> "St. Louis Rams", "Denver Broncos", "Indian…
$ games <int> 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16,…
$ pts_game <dbl> 33.8, 30.3, 26.8, 24.2, 24.8, 29.9, 22.9, 2…
$ pts_total <int> 540, 485, 429, 388, 397, 479, 367, 355, 315…
$ plays_scrimmage <dbl> 1014, 1115, 1026, 1024, 958, 1023, 1080, 99…
$ yds_game <dbl> 442.2, 409.6, 383.8, 377.5, 372.6, 361.0, 3…
$ yds_play <dbl> 7.0, 5.9, 6.0, 5.9, 6.2, 5.6, 5.3, 5.6, 5.1…
$ first_down_g <dbl> 23.8, 23.9, 22.3, 20.9, 19.9, 21.1, 19.9, 2…
$ third_conv <int> 86, 97, 94, 84, 86, 89, 100, 75, 92, 97, 84…
$ third_att <int> 181, 218, 201, 202, 188, 206, 235, 204, 247…
$ third_pct <dbl> 0.48, 0.44, 0.47, 0.42, 0.46, 0.43, 0.43, 0…
$ fourth_conv <int> 8, 9, 9, 10, 8, 3, 5, 4, 9, 10, 4, 11, 7, 4…
$ fourth_att <int> 13, 17, 10, 20, 13, 8, 14, 13, 12, 17, 10, …
$ fourth_pct <dbl> 0.62, 0.53, 0.90, 0.50, 0.62, 0.38, 0.36, 0…
$ penalty <int> 111, 89, 89, 134, 106, 118, 95, 118, 101, 1…
$ penalty_yds <dbl> 942, 792, 866, 1135, 908, 940, 703, 848, 91…
$ time_of_poss <dbl> 30.90000, 33.25000, 29.55000, 29.95000, 29.…
$ fumbles_total <int> 24, 26, 20, 22, 26, 18, 27, 23, 28, 26, 22,…
$ fumbles_lost <int> 12, 13, 14, 9, 11, 9, 14, 11, 13, 11, 12, 1…
$ turnover_ratio <int> -10, 19, -7, 2, -11, 17, 1, 3, 6, 9, 0, -5,…
$ stat <chr> "GAME_STATS", "GAME_STATS", "GAME_STATS", "…
$ season <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2…
$ season_type <chr> "REG", "REG", "REG", "REG", "REG", "REG", "…
$ role <chr> "offense", "offense", "offense", "offense",…
Because we are looking to run all the models at once, we’ll need to take the data structure from wide to longer, so we can nest() the datasets by metric and run the model on each metric pair. By pivoting to long format we get our team-level data by season and metric with the corresponding value of each season.
long_off_stats <- all_off %>%
select(team, pts_game:third_pct, season) %>%
mutate(season2 = season + 1) %>%
pivot_longer(
cols = c(-team, -season, -season2),
names_to = "metric",
values_to = "value")
long_off_stats
# A tibble: 5,742 x 5
team season season2 metric value
<chr> <int> <dbl> <chr> <dbl>
1 St. Louis Rams 2000 2001 pts_game 33.8
2 St. Louis Rams 2000 2001 pts_total 540
3 St. Louis Rams 2000 2001 plays_scrimmage 1014
4 St. Louis Rams 2000 2001 yds_game 442.
5 St. Louis Rams 2000 2001 yds_play 7
6 St. Louis Rams 2000 2001 first_down_g 23.8
7 St. Louis Rams 2000 2001 third_conv 86
8 St. Louis Rams 2000 2001 third_att 181
9 St. Louis Rams 2000 2001 third_pct 0.48
10 Denver Broncos 2000 2001 pts_game 30.3
# … with 5,732 more rows
Next we need to join the data back into itself to get the matched value 1 (year) with value 2 (year + 1). The join renames value on the left-hand side (value.x) and the right-hand side (value.y). Technically we don’t need season or season 2 anymore, but I’ve kept them so we can do a quick visual check on the data. The numbers look good and are aligned properly!
join_years <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
select(everything(), -season2.y)
join_years
# A tibble: 5,436 x 6
team season season2 metric value.x value.y
<chr> <int> <dbl> <chr> <dbl> <dbl>
1 St. Louis Rams 2000 2001 pts_game 33.8 31.4
2 St. Louis Rams 2000 2001 pts_total 540 503
3 St. Louis Rams 2000 2001 plays_scrimmage 1014 1007
4 St. Louis Rams 2000 2001 yds_game 442. 418.
5 St. Louis Rams 2000 2001 yds_play 7 6.6
6 St. Louis Rams 2000 2001 first_down_g 23.8 22.3
7 St. Louis Rams 2000 2001 third_conv 86 96
8 St. Louis Rams 2000 2001 third_att 181 192
9 St. Louis Rams 2000 2001 third_pct 0.48 0.5
10 Denver Broncos 2000 2001 pts_game 30.3 21.2
# … with 5,426 more rows
We now need to nest the data, leaving metric out so that it is used as the group/label data. Now each of the metrics are separated into their own respective nested datasets!
nest_off_data <- join_years %>%
nest(data = c(-metric))
nest_off_data
# A tibble: 9 x 2
metric data
<chr> <list>
1 pts_game <tibble [604 × 5]>
2 pts_total <tibble [604 × 5]>
3 plays_scrimmage <tibble [604 × 5]>
4 yds_game <tibble [604 × 5]>
5 yds_play <tibble [604 × 5]>
6 first_down_g <tibble [604 × 5]>
7 third_conv <tibble [604 × 5]>
8 third_att <tibble [604 × 5]>
9 third_pct <tibble [604 × 5]>
Now let’s fit the models and tidy the outputs with broom::glance(). We now have the raw fit and the tidy output as list-column tibbles! We’re really just interested in r.squared for this example, so we’ll unnest() the data in the next step to get that out by metric.
tidy_off_models <- nest_off_data %>%
mutate(
fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
tidy_output = map(fit, glance)
)
tidy_off_models
# A tibble: 9 x 4
metric data fit tidy_output
<chr> <list> <list> <list>
1 pts_game <tibble [604 × 5]> <lm> <tibble [1 × 11]>
2 pts_total <tibble [604 × 5]> <lm> <tibble [1 × 11]>
3 plays_scrimmage <tibble [604 × 5]> <lm> <tibble [1 × 11]>
4 yds_game <tibble [604 × 5]> <lm> <tibble [1 × 11]>
5 yds_play <tibble [604 × 5]> <lm> <tibble [1 × 11]>
6 first_down_g <tibble [604 × 5]> <lm> <tibble [1 × 11]>
7 third_conv <tibble [604 × 5]> <lm> <tibble [1 × 11]>
8 third_att <tibble [604 × 5]> <lm> <tibble [1 × 11]>
9 third_pct <tibble [604 × 5]> <lm> <tibble [1 × 11]>
Now we have a few options - we can use unnest_wider() to get ALL the model metrics, but again that’s overkill for our example.
tidy_off_models %>%
unnest_wider(tidy_output)
# A tibble: 9 x 14
metric data fit r.squared adj.r.squared sigma statistic
<chr> <lis> <lis> <dbl> <dbl> <dbl> <dbl>
1 pts_g… <tib… <lm> 0.185 0.184 3.98 137.
2 pts_t… <tib… <lm> 0.185 0.184 63.6 137.
3 plays… <tib… <lm> 0.160 0.159 42.8 115.
4 yds_g… <tib… <lm> 0.253 0.252 33.5 204.
5 yds_p… <tib… <lm> 0.192 0.191 0.456 143.
6 first… <tib… <lm> 0.303 0.302 1.93 261.
7 third… <tib… <lm> 0.0808 0.0792 11.1 52.9
8 third… <tib… <lm> 0.0613 0.0598 13.1 39.3
9 third… <tib… <lm> 0.179 0.177 0.0460 131.
# … with 7 more variables: p.value <dbl>, df <int>, logLik <dbl>,
# AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>
Instead we’ll use tidyr::hoist() which pulls specific columns from nested data. In this case, we are extracting just the r.squared values for each respective metric and then arranging by r.squared. Full details of unnest vs hoist can be found at tidyr site.
off_lm_output <- tidy_off_models %>%
hoist(tidy_output, r.squared = "r.squared") %>%
arrange(desc(r.squared))
off_lm_output
# A tibble: 9 x 5
metric data fit r.squared tidy_output
<chr> <list> <list> <dbl> <list>
1 first_down_g <tibble [604 × 5]> <lm> 0.303 <tibble [1 × 10…
2 yds_game <tibble [604 × 5]> <lm> 0.253 <tibble [1 × 10…
3 yds_play <tibble [604 × 5]> <lm> 0.192 <tibble [1 × 10…
4 pts_game <tibble [604 × 5]> <lm> 0.185 <tibble [1 × 10…
5 pts_total <tibble [604 × 5]> <lm> 0.185 <tibble [1 × 10…
6 third_pct <tibble [604 × 5]> <lm> 0.179 <tibble [1 × 10…
7 plays_scrimmage <tibble [604 × 5]> <lm> 0.160 <tibble [1 × 10…
8 third_conv <tibble [604 × 5]> <lm> 0.0808 <tibble [1 × 10…
9 third_att <tibble [604 × 5]> <lm> 0.0613 <tibble [1 × 10…
So we just want the r.squared and metric values, plus a label we can use for ggplot down the road. Boom we have the final output!
off_stability <- off_lm_output %>%
select(metric, r.squared) %>%
mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})"))
off_stability
# A tibble: 9 x 3
metric r.squared metric_label
<chr> <dbl> <glue>
1 first_down_g 0.303 first_down_g (R^2 = 0.303)
2 yds_game 0.253 yds_game (R^2 = 0.253)
3 yds_play 0.192 yds_play (R^2 = 0.192)
4 pts_game 0.185 pts_game (R^2 = 0.185)
5 pts_total 0.185 pts_total (R^2 = 0.185)
6 third_pct 0.179 third_pct (R^2 = 0.179)
7 plays_scrimmage 0.160 plays_scrimmage (R^2 = 0.16)
8 third_conv 0.0808 third_conv (R^2 = 0.081)
9 third_att 0.0613 third_att (R^2 = 0.061)
Now that may have seemed like a lot of code, mainly because we broke down an example. Let’s look at it all together now. We rearranged the dataset, nested, ran 9 models, and got our outputs in one pipe with just a few lines of code!
(off_stability <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
select(everything(), -season2.y) %>%
nest(data = c(-metric)) %>%
mutate(
fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
glanced = map(fit, glance)
) %>%
hoist(glanced, r.squared = "r.squared") %>%
arrange(desc(r.squared)) %>%
mutate(metric_label = glue::glue("{metric} (R^2 = {round(r.squared, 3)})")))
Now there’s another advantage to getting data into this longer format!
We can combine our labels that we generated with glue and the long-form data to plot ALL of the raw metrics at once with one ggplot. Note I have added an example line in light grey (slope = 1 for perfect fit) and a red-line for the lm for each dataset. That’s all for this example, but hopefully that opened your eyes to a nest-map-unnest workflow even for relatively simple models!
I’d definitely recommend trying out the rest of the tidymodels ecosystem for your more advanced machine learning and statistical analyses. You can learn all about it at the tidymodels.org site.
(off_plot <- long_off_stats %>%
inner_join(long_off_stats, by = c("season2" = "season", "team", "metric")) %>%
mutate(metric = factor(metric,
levels = pull(off_stability, metric),
labels = pull(off_stability, metric_label))) %>%
ggplot(aes(x = value.x, y = value.y)) +
geom_point(color = "black", alpha = 0.5) +
geom_smooth(method = "lm", color = "#ff2b4f", se = FALSE) +
scale_x_continuous(breaks = scales::pretty_breaks(n = 5)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5)) +
geom_abline(intercept = 0, slope = 1, color = "grey", size = 1, alpha = 0.5) +
facet_wrap(~metric, scales = "free") +
labs(x = "\nYear Y Metric",
y = "Year Y + 1 Metric\n",
title = "Offense Stats - 2000-2019",
subtitle = "Linear model fit comparing Year and Year + 1",
caption = "Plot: @thomas_mock | Data: espnscrapeR") +
theme_minimal() +
theme(strip.background = element_rect(fill = "black"),
strip.text = element_text(face = "bold", color = "white")))
# optional output
# ggsave("plots/off_stability_plot.png", off_plot, height = 10, width = 10, dpi = 450)